Search Result

Select

ODTrans: Fault Tolerant Transaction Protocols for the Cloud Data Store

CHENG Xu;LI Hongyan;WANG Tengjiao;YANG Dongqing

Acta Scientiarum Naturalium Universitatis Pekinensis DOI: 10.13209/j.0479-8023.2015.011

Select

EmBIOS: A BIOS Design for Embedded System Supporting MS Windows

LI Hao,ZHENG Yansong,PANG Jiufeng,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （806）

PDF（pc）（2751KB）（465）

Save

The authors present EmBIOS, a compatible BIOS design for embedded system to support desktop OS such as MS Windows. To effectively achieve OS compatiblity, a simulator BIOS which could boot desktop OS in simulator environment is divided into multiple interrupt service routines. Then by extending and transplanting interrupt service routines to traditional embedded firmware environment, EmBIOS enables initialization of embedded system with existing firmware, and provides BIOS compatibility required by desktop OS. The functional correctness and OS compatibility are guaranteed through running windows and its typical applications on PKUnity86 FPGA and silicon. Experimental results demonstrate that the portability of EmBIOS design and its acceptable boot up performance compared with a commercial embedded BIOS.

Related Articles | Metrics | Comments（0）

Select

Application-Specific Graphical Caching Scheme for Thin-Client Computing

ZHANG Yang,GUAN Xuetao,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （629）

Save

The authors investigate raw pixel redundancy caused by repainting application’s graphical objects and propose an application-specific graphical caching scheme to recognize and reduce this class redundancy. Effectiveness of the scheme is proved by implementation in frame buffer based thin-client system VNC. The experimental results show that the scheme could reduce about 17.8%-22.7% network traffic and most of high latencies caused by screen redundancy for the tested scenarios. Meanwhile the scheme costs only little additional computation and memory resource.

Related Articles | Metrics | Comments（0）

Select

Extending Virtual Machine Memory with Hypervisor Exclusive Cache

NIU Yan,YANG Chun,XIA Yubin,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （823）

Save

It is hard to accurately predict the memory demand of a virtual machine. Moreover, it is not reliable to request other virtual machines to release memory. Under-provision of memory will lead to severe performance degradation. To mitigate the impact, a hypervisor exclusive cache (HECache) is developed to extend the available memory of a virtual machine. A certain amount of memory is preserved as HECache in advance. The failed memory access in the VM is forwarded to HECache. All virtual machines running on the physical machine share HECache and can use it immediately. Through donating a little memory, all virtual machines can use more memory. The experiments conducted with both micro-benchmarks and real applications show that HECache can achieve up to 7. 9 times better performance, and the overhead is not significant compared with allocating the same amount of memory directly to a virtual machine. In addition, HECache is transparent to applications, and is complementary to the existing techniques such as ballooning, page-sharing, hotplug, etc.

Related Articles | Metrics | Comments（0）

Select

A Comprehensive Study of Executing ahead Mechanism for In-Order Microprocessors

WANG Xiaoyin,TONG Dong,DANG Xianglei,LU Junlin,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （620）

Save

The authors explore the design space of in-order executing ahead processors, and conduct sensitivity analysis of the executing ahead mechanism to the cache hierarchy and memory latency. It is demonstrated that reusing the pre-executed results is highly effective in improving performance and reducing energy consumption. The results also show that propagating valid data values between stores and dependent loads with a small store cache increases performance significantly. An in-order executing ahead processor with a 32-entry store cache and a 128-entry FIFO for preserving and reusing results increases performance by 24.07% over the baseline processor, with an energy overhead of 4.93%. Furthermore, it is revealed that executing ahead is necessary for hiding memory access latencies even with a very large cache hierarchy. With increasing memory latency, the performance and energy-efficiency benefits provided by executing ahead are more significant.

Related Articles | Metrics | Comments（0）

Select

Standard-Cell-Based Temperature Sensor with Calibrated Supply Noise Tolerance

TIE Meng,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （723）

Save

A standard cell based temperature sensor with calibrated tolerance for supply shift is proposed. Traditional digital circuit temperature sensors have large error caused by supply voltage shift since they are sensitive to supply voltage. The pure standard cell attribute makes the sensor very easy to be designed with normal digital circuit design flow. After 2-voltage calibration, error caused by almost 0.1 V supply shift is reduced to 28. 5℃ compared to 90℃ of previously proposed dual-ring sensor.

Related Articles | Metrics | Comments（0）

Select

A Basic-Block Reordering Algorithm Based on Neural Networks

ZHANG Jiyu,LIU Xianhua,LIANG Kun,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （889）

Save

The authors present a basic-block reordering method that detects typical structures in the control-flow graph. It utilizes the architecture-specific branch cost model and execution possibilities of control-flow edges to estimate the possible layout costs of specific sub-structures. The layout with the minimal cost estimation would be chosen. The authors further investigate a novel approach to apply neural network to predict execution possibility for each edge. A set of programs are chosen to record particular static information of the edges in the typical structures. The data include the knowledge about the relationship between static program features and dynamic behaviors. It is fed to train an improved back propagation neural network (RPROP). The algorithm is implemented based on a simple pipeline UniCore microprocessor. Experiment result shows that it improves programs?performance about 8% , which indicates that the execution possibility of edges may be predicted using machine learning techniques.

Related Articles | Metrics | Comments（0）

Select

Microarchitectural Design Space Exploration via Support Vector Machine

PANG Jiufeng,LI Xianfeng,XIE Jinsong,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （733）

Save

The authors propose an approachto reducethe number of required si mulations, simulate on sampled design points, and use it to construct informative and predictive support vector regression models. Having captured the interacting effects of design parameters, the models predict outputs for design points that are not simulated. The prediction time of model can be negligible compared with detailed simulation. The optimal design point determined by prediction is very close to that of simulation for most applications and provides an efficient wayto cull huge design space. Trained on only 0.26 % design points, the models yield mean relative prediction error as low as 0 .52 % for performance and 1 .08 % for power. Correlation analysis demonstrates that prediction output is highly correlated with simulated observation. The average squared correlation coefficient is 0.728 for performance models while 0.703 for power models, which implies that support vector regressions capture most of relationships among design parameters. The model also provides a predictive probability interval for each prediction, which is informative for computer architects.

Related Articles | Metrics | Comments（0）

Select

Improvement of the Interactive Performance Isolation of Virtual Machines on Xen Platform

XIA Yubin,YANG Chun,NIU Yan,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （648）

Save

The authors address the problemthat in a highly-consolidated environment, there are continuous peaks of network latency of guest OS. Three optimizations of VM scheduler are designed and implemented to improve interactive performance isolation, including cooperative preemption, preempt-back and accurate accounting. None of these optimizations needs guest OS to be modified. The evaluation results show that with 8 computing-intensive VMs running concurrently, the average of top 5% network latency of other 8 VMs is reducedto as mush as 0.93% of the original one, and the one of web-mail browsing by Firefox is reduced to 56.1%.

Related Articles | Metrics | Comments（0）

Select

Analysis and Practice of a SoC Hardware Kernel for MS Windows

ZHENG Yansong,TONG Dong,LI Hao,PANG Jiufeng,WNAG Keyi,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （719）

Save

The authors study a method that develops a SoC hardware kernel for MS Windows. The method captures the basic system function specification of hardware kernel through multiple simulation execution and gradual drawoff, on the premise that the system is MS Windows compatible. The experiment indicates that the complexity of the hardware kernel is simpler drastically than that of the whole system, and that the requirement of hardware kernel among MS Windows versions is different obviously. Moreover, the SoC hardware kernel for MS Windows 98 is verified on the FPGA prototype.

Related Articles | Metrics | Comments（0）

Select

CacheCompress: A Novel Approach for Test Compression for IP Cores

FANG Hao,SONG Xiaodi,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （592）

Save

A novel test data compression technique named CacheCompress is proposed. Different from the previous static dictionary based techniques, this dictionary is dynamic. During testing the dictionary is accessed by read and write operations and only needs to keep the most frequency used data thus to largely decrease the memory size requirement and eliminate the explicit dictionary initialization step. Experiments show that CacheCompress achieves 30% higher compression ratio than other recent compression schemes while the dictionary size dramatically reduces to 1‰.

Related Articles | Metrics | Comments（0）

Select

RiTLB: iTLB Design Based on Memory Region Reusing

XIE Jinsong,TONG Dong,LI Xianfeng,PANGJiufeng,WANG Keyi,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （781）

Save

In order to design iTLB by memory region reusing, its comparison bits of lookup are reduced through the memory region encoding technology firstly, which encodes the higher-order bits of VPN with a very shorter memory region ID before the VPNis sent to iTLB. Secondly, the memory region IDis reused before the next memory region is switched into. Compared to the baseline iTLB, experimental results show the average dynamic power, delay and area of the new design decrease by 62.84%, 9.96% and 44.78% respectively, with only 0.23% average IPC reduction.

Related Articles | Metrics | Comments（0）

Select

A Profit-Driven Algorithm for Semantic Code Motion

NIE Jiutao,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （796）

Save

The potential reasons of negative effects of aggressive code motion were analyzed. The authors built the profit model and proposed a profit-driven semantic code motion algorithm, which determined if an existing result should be reused. The new algorithm was i mplemented in GCC-4.2.0. The experimental results achieved from SPEC2000 on an X86 machine show that the code generated by the GCC using the new algorithmis 6.8% and 2.6% faster on average than that using semantic code motion and that using the GCCs original code motion algorithm GVNPRE .

Related Articles | Metrics | Comments（0）

Select

Maximum Power Analysis Based on Bayesian Inference and Vector Compression Techniques

CHEN Jie,LI Xianfeng,TONG Dong,WANG Keyi,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （687）

Save

To resolve oversize time consuming problem in simulation based maximum power analysis, Bayesian power model based on slice analysis is proposed. This model selects the input vector set which may generate maximum power and performs accurate power estimation for the compact sequence. The relationship between signal switch density and maximum power generation is analyzed, and then an input vector generation platform with switching density self-adaptation computing and Bayesian vector compression is proposed. The experimental results indicate that, Bayesian vector compression method results in 1005 times average estimation time speed-ups, and the average maximum-power error is 2.40%. When using vector generation method based on self-adaptation computation and Bayesian vector compression, the maximum power bottom limit can be increased with 1.99%, and average speed-ups reaches 163 times.

Related Articles | Metrics | Comments（0）

Select

Clock Skew Scheduling for Area Optimization

WANG Kui,DONG Haiying,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （557）

Save

A new clock skew scheduling algorithm is proposed. This algorithm generates timing constraints which can effectively promote the area optimization of logic synthesis. During clock skew scheduling, the slacks are not equally assigned to the arcs in critical cycles. In stead, they are assigned according to the arc weights which are calculated considering the area impact of the corresponding paths. Experiment results show that this approach can efficiently reduce area of logic synthesis results compared with the traditional clock skew scheduling algorithm, without degrading the performance.

Related Articles | Metrics | Comments（0）

Select

An Arbitration Approach of Efficient BandwidthAllocation and Low Latency for SoC Communication

LU Junlin,LIU Dan,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （610）

Save

A novel arbitration approach for SoC communication is presented. It provides fine-grained bandwidth allocation which is based on dynamically updated records of communication status. The arbiters in NoC routers, multi-port DRAM controllers and shared buses can adopt this approach to improve system performance. This proposed approach is evaluated with a metric called bandwidth shortage, which reflects the closeness of the actual bandwidth allocation to an optimal one. Experimental results reveal that this arbitration approach can reduce the bandwidth shortage decreases by 13%, and shorten the communication latency by 37.5%. Furthermore, the results of hardware implementation show that it is efficient in area and timing for large-and medium-scale SoC designs.

Related Articles | Metrics | Comments（0）

Select

A Fast Hierarchical Multi-Objective Mapping Approach for Mesh-Based Networks-on-Chip

LIN Hua,ZHANG Liang,TONG Dong,LI Xianfeng,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （532）

Save

The authors proposes a fast hierarchical multi-objective mapping approach (HMMap) for mesh-based NoC Based on partition and multi-objective heuristic techniques, HMMap automatically maps large number of IP cores onto NoC architecture and makes good tradeoffs between communication energy and latency Experimental results show that proposed approach achieves shorter execution time, lower energy and latency compared with others With the increasing of NoC size, the optimization effect of HMMap becomes more obvious

Related Articles | Metrics | Comments（0）

Select

An Age Encoding Based Bloom Filter Algorithm for Load-Store Queue Energy Reduction

ZHAO Yulai,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （565）

Save

The load-store order violations and load-load order violations are considered in multithreaded or multiprocessor systems, and the counter-based bloom filter algorithm is improved by eliminating false positives through age encoding. The filtering ratio is improved by over 5% with no impacts on pipeline timing or performance.

Related Articles | Metrics | Comments（0）

Select

SSDC: A Split Data Cache Design for Sequential Access Intensive Applications

LIU Shu,GOU Xiaogang,QU Ning,LI Xianfeng,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （591）

Save

Caches are widely used to reduce the speed gap between processors and memories. However, the spatial locality of sequential data accesses existing in many popular applications is not well exploited by conventional data cache. In response to these problems, the Split Sequential Data Cache (SSDC) is proposed, in which the sequential access detector can predict whether data accesses are sequential, and direct them to the right sub cache. Experiments show that the SSDC outperforms the conventional data cache and other schemes. It reduces the miss rate of applications with intensive sequential data accesses with only a little increment of bandwidth requirement. Meanwhile, the experimental results on SPEC2000Int show that SSDC does not hurt the performance of applications without large sequential accesses.

Related Articles | Metrics | Comments（0）

Select

A Low-Leakage Pipelined Instruction Cache Design

SUN Hanxin,WANG Xiaoyin,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （734）

Save

Pipelined level one instruction cache (PIL1) has been proposed to improve instruction fetch bandwidth in high frequency processor. However, few researches in the literature have focused on reducing the leakage power in PIL1. Here,the authors observe that the PIL1 structure naturally lends itself to provide inherent leakage power saving opportunities. Based on this observation, the authors propose to manage cache line activities according to the demand of the fetch address, which activates only the requested line and keeps others in low-voltage mode, thereby saving leakage power effectively. Simulation results demonstrate that the PIL1 leakage power is reduced by an average of 77.3%. Meanwhile, the performance degradation is only 0.32% and no timing overhead is induced.

Related Articles | Metrics | Comments（0）

Select

A Semi-Centralized Computing Model for Network Computer Systems

YANG Chun,XIA Yubin,NIU Yan,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （586）

Save

A semi-centralized computing model is proposed for network computer systems, which enables clients to participate more computing workloads and to provide seamless operating experience for users while all management advantages of traditional network computer systems are preserved. The authors investigates the strategies for computing partition, input integration and display integration, and implements a prototype of video player based on the semi-centralized computing model. Experimental results show that it can provide seamless video playback and reduce server load dramatically.

Related Articles | Metrics | Comments（0）

Select

Power-Aware Gated Clock Routing with Merging Cost Backward Annotation Using Simulated Annealing Method

DUAN Lian,XU Hu,WANG Kui,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （748）

Save

Traditional clock routing algorithms can be extended to embrace clock gating by merging minimum switching capacitance node pairs in the bottom-up phase. However, optimizing switching capacitance in the current merging nodes will affect their ancestors' gating chances, which may deteriorate the power consumption. A zero-skew gated clock routing algorithm is proposed to solve this problem. It can reduce the total switching capacitance by evaluating the merging cost of this effect using the result derived from the clock tree generated in the last round. As the result needs to be optimized in iterations, this algorithm employs a simulated annealing technique. At each iteration, the clock tree reconstructs using back-annotated merging cost information and new constraints are generated for optimization in the next round. Experiment results show that this algorithm can achieve up to 23% power reduction compared to the traditional Greedy-DME algorithm.

Related Articles | Metrics | Comments（0）

Select

Hierarchical Network-on-Chip Design Method

WANG Hongwei,LU Junlin,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （642）

Save

With the development of VLSI technology and increasing complexity of System-on-Chip applications, on-chip communication architecture design encounters some problems, such as throughput, power, signal integrity, latency and clock synchronization, Network-on-Chip (NoC) was introduced. With on-chip communication's specific pattern, it is of great significance to design hierarchical Network-on-Chip to improve communication performance and reduce hardware cost. This paper puts forward a hierarchical NoC design method. According to the technology and application requirements, researchers can generate several IP core subsets (“cluster”), and design a NoC architecture as inter-cluster communication requirements. Experiments show with hierarchical NoC design method, this method can improve system performance efficiently, decrease hardware cost, and meet Quality-of-Service requirements at the same time.

Related Articles | Metrics | Comments（0）

Select

CMOS Combinational Circuit Leakage Power Reduction Using Genetic Algorit

ZHAO Xiaoying,YI Jiangfang,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （648）

Save

A leakage power reduction platform for CMOS combinational circuits by means of input vector control is presented. Genetic algorithm is used for searching minimum leakage vector and circuit status difference is used as fitness function. Experimental results show that this circuit status difference based genetic algorithm can achieve satisfied leakage power reduction, and runtime is reasonable. This method has no requirement for HSpice simulation and independent from target technology library.

Related Articles | Metrics | Comments（0）

Select

Characterizing the d-TLB Behavior of Typical Applications on Network Computer

QU Ning,YUAN Peng,GUAN Xuetao,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （780）

Save

Network computer is an interactive device in thin-client computing environment, and studying the behavior of typical applications on this platform is important to the microprocessor design and system development. Based on PKUnity network computer platform, this paper analyzes the d-TLB miss rate and performance penalty of many typical applications under different d-TLB structures and page sizes. The experiment results explain the advantage of TLB design in PKUnity SoC which satisfies the requirement of lower power and low complexity.

Related Articles | Metrics | Comments（0）

Select

GATEST: A Validation Platform of Automatic Simulation Vectors Generation Using Genetic Algorithms

YI Jiangfang,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （764）

Save

The approaches of simulation-based validation need a large amount of simulation-vectors for verifying the corner cases of VLSI designs. The authors developed a validation platform of automatic simulation vectors generation based on the path coverage metric using genetic algorithm for RT-level designs. Given the critical signals, it used techniques of data flow analysis to acquire the critical path set and choose the critical path coverage to be the fitness function used in the GA. The authors performed experiments on some functional modules of Unity-863 SoC. The relationship between the final results and the control factors were also analyzed in detail. The results show that GATEST is effective and efficient.

Related Articles | Metrics | Comments（0）

Select

Design Features of a High Throughput RSA Cryptoprocessor

LIU Qiang,MA Fangzhen,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （653）

Save

Montgomery multiplication algorithm is optimized for large-bit modular multiplication and VLSI implementation. It is combined with the R-L (Right to Left) binary method to achieve speed improvement. Special efforts are focused on the problems with long-bit modular arithmetic. A Carry-Save-Adder architecture, which is implemented by redesigned (4:2) compressors, is used in the multiplier to avoid the long carry propagation. A signal-backup strategy is used to resolve the problem of signal broadcasting. Using a multiplexer-based method, the datapath of the multiplier is reconfigurable to perform either one 1024-bit-multiplication or two 512-bit multiplications in parallel. The Chinese Remainder Theorem (CRT) increases the decryption data rate by a factor of 3.8.

Related Articles | Metrics | Comments（0）

Select

The Effect of Periodic Disturbance to the Hierarchical Structure in Turbulent Boundary Layer

CHENG Xueling,HU Fei

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （644）

Save

Artificial periodic disturbances are introduced to the outer field of turbulent boundary layer in an closed-circuit open water channel. Statistical method is employed for analyzing the velocity-fluctuation-time-series. The effect of the disturbance to turbulent structure in boundary layer is studied. The result indicates the She-Leveque hierarchical similarity exists among high frequency turbulence.

Related Articles | Metrics | Comments（0）

Select

RSA Cryptoprocessor Based on a Redesigned Systolic Array

LIU Qiang,MA Fangzhen,TONG Dong,CHENG Xu

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （663）

Save

A novel and generic approach is presented to the hardware implementation of the RSA cryptoprocessor in deep submicro (DSM) technology with a redesigned systolic array. With deep submicro technology scaling, integrated circuit performance bottleneck has shifted from logic gates to global interconnection. Besides using the systolic architecture which is popular in hardwarebased RSA systems, a blockbased scheme is proposed to eliminate global signals, with a pipelined bus to convey data globally. The control signals and intermediate results used for sequential multiplications are transmitted by shift registers. All signals, except for the clock signal, are limited in one block or between two adjacent blocks. The Chinese Remainder Theorem (CRT) technique increases the decryption data rate by a factor of four. Two redundant blocks are added to adapt to the online partition of the multiplier and the variation of the length of P and Q in CRT mode. The blockbased global signal transportation scheme and the redundancy scheme are quite different from those of previous works.

Related Articles | Metrics | Comments（0）